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ABSTRACT 

We focus our attention on multiple repeats of one 
amino acid (homorepeats) and create a new 
database (named HRaP, at http://bioinfo.protres. 
ru/hrap/) of occurrence of homorepeats and dis- 
ordered patterns in different proteomes. HRaP is 
aimed at understanding the amino acid tandem 
repeat function in different proteomes. Therefore, 
the database includes 122 proteomes, 97 eukaryotic 
and 25 bacterial ones that can be divided into 9 
kingdoms and 5 phyla of bacteria. The database 
includes 1449561 protein sequences and 771786 
sequences of proteins with GO annotations. We 
have determined homorepeats and patterns that 
are associated with some function. Through our 
web server, the user can do the following: (i) 
search for proteins with the given homorepeat in 
122 proteomes, including GO annotation for these 
proteins; (ii) search for proteins with the given dis- 
ordered pattern from the library of disordered 
patterns constructed on the clustered Protein Data 
Bank in 122 proteomes, including GO annotations 
for these proteins; (iii) analyze lengths of 
homorepeats in different proteomes; (iv) investigate 
disordered regions in the chosen proteins in 122 
proteomes; (v) study the coupling of different 
homorepeats in one protein; (vi) determine longest 
runs for each amino acid inside each proteome; and 
(vii) download the full list of proteins with the given 
length of a homorepeat. 

INTRODUCTION 

It was found that motifs with low complexity occurred in 
eukaryotic proteomes (including the human one) more 
frequently than other protein motifs (1-3). One such 
motif is a homorepeat, which is the region with repeating 
of a single amino acid. It turned out that homorepeats 
play important roles in some biological processes 



(1,2,4,5). Homorepeats of some amino acids occur more 
frequently than homorepeats of other amino acids, and 
the type of homorepeats varies in different proteomes 
(3). For example, EEEEEE appears to be most frequent 
for Chordata, QQQQQQ for Arthropoda and SSSSSS for 
Nematoda (3). One can suggest that such homorepeats 
may be molecular recognition elements for proteins. A 
growing number of studies suggest that homorepeats 
may have a broader role in human diseases than was pre- 
viously recognized (6). It should be stressed that expansion 
of homorepeats is a molecular basis for at least 18 human 
neurological diseases. For example, expansion of poly-A 
in polyadenine-binding protein 2 is associated with 
oculopharyngeal muscular dystrophy (7). Long poly-A 
tracts cause several human developmental diseases 
(5,8,9). Expansion of poly-Q (larger than 36 residues 
long) in the Huntington gene results in Huntington's 
disease. Moreover, poly-Q tracts are associated with 
several ataxias (8,10). Therefore, perceiving the functional 
role of these patterns, homorepeats in particular, in the 
proteomes is a formidable challenge. 

With active studying of disordered regions and their 
functioning, we focus our attention on multiple long 
repeats of one amino acid (homorepeats) (see Figure 1). 
The longest uninterrupted runs in the Dictyostelium 
discoideum proteome are of 306 residues for serine, 79 
for glutamine, 90 for asparagine and 55 for glutamic 
acid. The longest uninterrupted runs in the human 
proteome are of 58 residues for serine, 74 for glutamine, 
58 for aspartic acid and 53 for lysine. It is just the time to 
make a more careful analysis of occurrence, evolution and 
conservation of these repeats to find their functions. It is 
still unknown why genetically unstable homorepeats have 
been preserved throughout evolution, but now it is very 
important to perform evolution searching of occurrences 
of homorepeats in different classes. Recently the func- 
tional determination of some such motifs has been done. 
For example, histidine repeats in the protein kinase 
DYRK1A (length of 13) and in the protein FAM76B 
(length of 10) mediate nuclear speckle trafficking 
(5,11,12). Poly-A tract in the HOXD13 protein (length 
of 15) is important in limb development (5). It has been 



*To whom correspondence should be addressed. Tel: +7 4967 318275; Fax: +7 4967 318435; Email: ogalzit/a/vega.protres.ru 
© The Author(s) 2013. Published by Oxford University Press. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ 
by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial 
re-use, please contact journals.permissions@oup.com 



D274 Nucleic Acids Research, 2014, Vol. 42, Database issue 



10000 



CD 
E 



1000 



CD 
CD 
Q. 

c 
> 

0 

£ 100 



03 
C 

ffl 10 
o 



0.1 




Dictyostelium discoideum 
(Slime mold) 



n nnn NNn n nNn . NnnNn 



nn Nnn 



fl NN 

qQ« ScaQQQooQQ n nnn nn 



N 



KfffV BtBB TlED W P^B Bs' 



0 



10 



40 



50 



20 30 

Length of repeats 

Figure 1. Dependence of the number of proteins that contain homorepeats of different lengths for 20 amino acids in D. discoideum proteome. 



predicted that the most parts of the homorepeats are dis- 
ordered (3,13). It should be noted that homorepeats such 
as KKKKK, PPPPP and HHHH are included in the 
library of disordered patterns (14). It is worth mentioning 
that in living organisms, homopeptides can be of non- 
ribosomal origin as well (2). Comparative analysis of 
amino acid repeats in some proteomes has been done 
(2,9,15,16). To gain a clear insight into the abundance of 
homorepeats and disordered patterns, we have created a 
database of occurrence of homorepeats with different 
lengths and disordered patterns (HRaP) in 122 eukaryotic 
and bacterial proteomes. Our database includes 1 449 561 
protein sequences from 122 proteomes, 771 786 sequences 
of proteins with GO annotations (17) and all homorepeats 
and 412 disordered patterns from three sets (14,18,19). 



DESCRIPTION OF THE DATABASE 

We considered 3617 proteomes from the European 
Bioinformatics Institute site (ftp://ftp.ebi.ac.uk/pub/data 
bases/SPproteomes/uniprot/proteomes/). Because the dis- 
ordered patterns with the frequent occurrence in prote- 
omes have low complexity (homorepeats), we performed 
a preliminary analysis. Figure 2 shows the dependence of 
the number of proteins with at least one occurrence 
of homorepeats of 6 and more residues long on the size 
of proteomes. One can see the weak dependence of the 
occurrence of homorepeats on the size of proteomes. 
The general result following from this analysis is that 
the homorepeats appear more often in the eukaryotic 
proteome than in other proteomes (bacterial, archaeal 
and viral ones). From Figure 2, one can also see that the 
number of proteins with at least one occurrence of 
homorepeats of 6 residues long is < 100 for proteomes 



with an overall number of residues <2 500 000. The data 
gave grounds for our research involving only proteomes 
with an overall number of residues exceeding 2 500000 
(19). We obtained 122 proteomes taking into account 
the length of proteomes representing nine kingdoms of 
eukaryotes and five phyla of bacteria (see Table in 
HRaP: proteomes). In view of these proteomes, we have 
1449 561 protein sequences. It should be mentioned that 
the possible use of this database (named HRaP) is not 
restricted only to the tasks connected with investigations 
of disordered regions in proteins and proteomes. 
Disordered regions can be calculated by using our 
programs IsUnstruct (14,20) and FoldUnfold (21). It 
should be noted that recently the new published 
methods for the prediction of disordered regions are 
usually meta-servers that combine multiple disorder pre- 
dictors, e.g. MD (22), PONDR-FIT (23) and MFDp (24). 
There are separate methods for predicting short [<15 
residues in the program PONDR VSL2 (25)] and long 
disordered residues [>30 residues PONDR VSL1 (26)]. 
Our method IsUnstruct demonstrates a high accuracy in 
predicting both short and long disordered regions. 

HRaP can be used to analyze evolution differences 
between proteins from different proteomes and connec- 
tions of these regions with some definite functions. The 
database includes 771 786 of proteins with GO annota- 
tions. It has been found that leucine repeats were espe- 
cially abundant in the 'Receptor and/or Membrane' 
group, glutamine and alanine repeats in 'Transcription 
Factor and/or Development' group, and lysine repeats in 
the 'Metabolism' group (2,5). 

To see the occurrence of a homorepeat, at the first step 
the user should choose a proteome among 122 considered 
ones, and then at the second step choose the investigated 
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Figure 2. Dependence the number of proteins with at least one occurrence of homorepeats of >6 residues long in 3617 proteomes on the size of 
proteomes. 



homorepeat with the given length or pattern (see 
Figure 3). It should be noted that the order of amino 
acids and patterns is not random. As concerns the first, 
the order groups similar amino acids together. From the 
table of occurrence of homorepeats for different lengths, 
one can see that long homorepeats appear more often for 
polar and charge amino acids. The patterns have been 
ordered according to their significance for prediction of 
disordered regions. These numbers have been assigned in 
the corresponding articles (14,18,19). After that, the list of 
proteins with the given homorepeat or pattern appears 
with GO annotations (if such is determined). Usually, 
long proteins contain a homorepeat or several different 
homorepeats. If several homorepeats and patterns exist 
in a protein, then all these regions will be marked by dif- 
ferent colors in the sequence. In the section HomoRepeats 
and Patterns, the user can find the occurrence of 
homorepeats with different lengths and disordered 
patterns for all 122 proteomes. The largest fraction of 
homorepeats of six and more residues long belongs 
to Amoebozoa proteomes (D. discoideum), 46% (see 
Figure 4). The longest uninterrupted runs in D. discoideum 
proteome are of 306 residues for serine, 79 for glutamine, 
90 for asparagine and 55 for glutamic acid. The most 
frequent amino acid runs in the 122 proteomes occur for 
poly-Q (6 < the length of tract <15), poly-S, poly-A, poly- 
G, poly-N, poly-P and poly-E (in decreasing order). The 
acidic runs poly-E and poly-D exceed the runs poly-Q and 
poly-N for tracts with a short length until 5. The relation- 
ship is changed for the long tracts. The occurrence of basic 
runs poly-K exceeds the runs poly-R, and poly-S exceeds 
the runs poly-T for all lengths of homorepeats. 

Homorepeats and patterns associated with the function 

We can suggest that homorepeats and patterns are respon- 
sible for common functions of nonhomologous, unrelated 
proteins from different organisms. To confirm this, we 



have done the following analysis. All possible GO anno- 
tations for proteins were taken for the set of 122 prote- 
omes. The number of different kinds of all annotations is 
11 313. Proteins without annotations were combined into 
the class «absent annotation)). The number of proteins 
including at least one pattern from the last version of 
the library [171 patterns, set 2012 (14)] was calculated, 
«N pt ». The number of proteins including homorepeats of 
length >6 residues long was calculated, «N hm » as well. 
The number of proteins with the given annotation was 
also calculated and indicated in the column «N go ». For 
example, we found 60 proteins with GO annotations of 
functional kind (F) as ATP binding and including the 
pattern IKSHHNVGGLP. The same pattern is associated 
also with the guanosine monophosphate (GMP) synthase 
(glutamine-hydrolyzing) activity and GMP biosynthetic 
process. For each pattern or homorepeat, we can calculate 
the frequency of occurrence in all proteins: 

N pt 

N a ll 

where N pl is the number of proteins with the given 
homorepeat or with the given pattern, N t ,u = 1 449 561 is 
the full number of proteins in 122 proteomes and the total 
number of GO-annotated proteins is 771 786. The number 
of proteins with the given homorepeat (or pattern) and 
annotation (N ptigo ) is given in the Table (section GO an- 
notations). The probability to find the number of proteins 
Npt igo and larger among all proteins with the given anno- 
tation was calculated as: 



N I 



i\-(N g0 -i)\ 



if'' ■ (1 - ii') 



N„-i 



Taking into account 171 patterns, 20 homorepeats 
and 11313 kinds of GO annotations, we have 
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Figure 4. The percentage of proteins with at least one occurrence of homorepeats of >6 residues long in 122 proteomes. 



11 313*(171 +20) = 2 160 783 «2-10 6 possible combin- 
ations. Therefore, we should not pay attention to the 
events in which the probability is higher than 10~ 7 . 
Taking this into account, the probabilities p z were 



colored according to the following conditions: green cor- 
responds to p : <10~ 15 , light green corresponds to 
10~ 15 <p~ < 10~ 10 and yellow corresponds to 
lO -10 </?"-< 10 -7 . 
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We also calculated the probabilities: 



P I 



and p2 



N. 



The patterns and homorepeats are sorted by p\ and p 2 
using the following colors: green-/) x > 0.5, light green- 
0.3 < 0.5 and light yellow-0.1 <p\ < 0.3. The 
patterns and homorepeats associated with the functions 
are presented in section GO annotations. It is interesting 
to note that histidine, alanine, glutamine and glutamine 
acid repeats are connected with GO annotation 'C: 
nucleus'. As has been mentioned in the Introduction, his- 
tidine repeats mediate nuclear speckle trafficking in several 
transcription factors (5,11,12). The methionine repeat is 
connected with the voltage-gated calcium channel 
activity. Proline homorepeats are associated with many 
GO annotations: dendrite self-avoidance, central nervous 
system morphogenesis, bacterial cell surface binding, axon 
guidance receptor activity, axon extension involved in 
axon guidance, actin polymerization or depolymerization, 
Rho GTPase binding, mushroom body development, 
actin cortical patch, axonal fasciculation, actin cytoskel- 
eton organization, peripheral nervous system develop- 
ment, cell morphogenesis, tropomyosin binding and 
stereocilium. Also, it should be noted that not all amino 
acid repeats are associated with some functions. 

Among 109 disordered patterns (set 2010), 8 occur (with 
precise coincidence) only in the Protein Data Bank but are 
absent in 122 proteomes. Among 141 patterns (set 2011), 
there are only 6 such patterns, and 8 among 171 patterns 
(set 2012). Such patterns as TTTATT and NNNNN (from 
set 2012) occur > 17 000 times in the considered 122 prote- 
omes. The leader is QQQQQQQ, which occurs >20000 
times. Moreover, the pattern NNNNN is connected with 
such process as symbiosis, encompassing mutualism 
through parasitism. This pattern occurs very seldom in 
the human proteome, only in 21 proteins. 

We have created the list of human proteins with 
homorepeats that are associated with disease. The list 
can be found in the frequently asked questions section. 
Also, the list of proteins with homorepeats of 6 and 
more residues long from the clustered Protein Data 
Bank (14) can be found in the frequently asked questions 
section. 

Correlations between number of proteins with homorepeats 
or patterns in any proteome 

For each proteome, we calculated a set of 109 values re- 
flecting the number of proteins containing at least one 
disordered pattern for each of the 109 patterns from the 
library. Then considering all possible pairs of proteomes, 
the correlation coefficients between the 109 values have 
been calculated resulting in the matrix of correlation co- 
efficients. The correlation coefficient was calculated for 
each pair of proteomes separately, and then averaging 
has been done inside each kingdom and phylum (see 
Correlations section). Similar values have been calculated 
for a set of 141 disordered patterns, 171 disordered 
patterns and 20 homorepeats. A comparative analysis of 
the number of proteins containing homorepeats of 6 and 



more residues long in 122 proteomes has demonstrated 
that the correlation coefficients between numbers of 
proteins, where at least once a homorepeat of 6 and 
more residues long for each of the 20 types of amino 
acid residues appears in 9 kingdoms of eukaryota and 5 
phyla of bacteria, are higher inside the considered 
kingdom than between them (3). The same result is valid 
for the 109 disordered selected patterns (set 2010) (18), the 
141 disordered selected patterns (set 2011) (19) and the 171 
disordered selected patterns (set 2012) (14). 



CONCLUSIONS AND FUTURE DIRECTIONS 

We have collected an exhaustive database of occurrence of 
homorepeats and patterns in 122 proteomes with the 
number of residues larger than 2 500 000 in each 
proteome. The found patterns and homorepeats 
associated with the function point to the tremendous im- 
portance of homorepeats in a large variety of cellular 
processes and merit further studying. In future work, we 
are planning to include the analysis of coupling between 
occurrences of different homorepeats in one protein and 
to make clusterization of proteins to escape the influence 
of homologous proteins for determination of homorepeat 
functions. We will be grateful for any contribution to the 
database from the community. 
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